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[57] ABSTRACT 

The present invention relates to a computer method, appa- 
ratus and programmed medium for clustering databases 
containing data with categorical attributes. The present 
invention assigns a pair of points to be neighbors if their 
similarity exceeds a certain threshold. The similarity value 
for pairs of points can be based on non -metric information. 
The present invention determines a total number of links 
between each cluster and every other cluster bases upon the 
neighbors of the clusters. A goodness measure between each 
cluster and every other cluster based upon the total number 
of links between each cluster and every other cluster and the 
total number of points within each cluster and every other 
cluster is then calculated. The present invention merges the 
two clusters with the best goodness measure. Thus, cluster- 
ing is performed accurately and eflficiently by merging data 
based on the amount of links between the data to be 
clustered. 

57 Claims, 6 Drawing Sheets 
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METHOD, APPARATUS AND cluster are more similar to each other than data points in 

PKOGRAMMED MEDIUM FOR different clusters. Existing clustering methods can be 

CLUSTERING DATABASES WITH broadly classified into parlitional and hierarchical methods. 

CATEGORICAL ATTRIBUTES Parlitional clustering attempts to determine k partitions that 

5 optimize a certain criterion function. The most commonly 

FIELD OF THE INVEN^HON used criterion function is: 

This invention relates to the art of data mining, and in 



particular to method, apparatus and programmed medium _ y _^ 

for clustering databases with categorical attributes by using ^tVc,. 
links to measure the similarity and proximity between data 
points and robustly merging clusters based on these links. 

In the above equation, m, is the centroid of cluster C,- 
BACKGROUND OF THE INVENTION while d(x^ is the Euclidean distance between x and m,. 

ui f -J * - • 1 1 J J- • The Euclidean distance between two points (Xi, x-,, . . . x^) 

The problem of data mmmg or knowledge discovery is is _, . x • rf/ \2\^n. \ • • 

becoming increasingly important in recent years. There is an ^7^' 1^ ■ ■ ■ V-^ (^'r' ^^'l^ ) • Thus the cnlenon 

enormous wealth of information embedded in large corpo- ^ attempts to mimmtze the distance of every pomt 

rate databases and data warehouses maintained by retailers, ^"'^ """" °^}^^ = '? ^'"='1 P°"!' "^'""SS; A 
telecom service providers and credit card companies that .fPPro''«=h 's to mmm,.ze the cnlenon function 

contain infonnation related to customer purchases and cus- 20 "^'"S h.ll-chmbmg technique. That is. starting 

tomer caUs. Corporations could benefit immensely in the '"'^f "fk Partitions, data points are moved 

r 1 J • J 1 r • . 1 irom one cluster to another to improve the value of the 

areas of marketing, advertismg and sales if interestmg and . viu;>iv.i au^m^i i^upiuvw 

previously unknown customer buying and calling patterns criterion ction. 

could be discovered from the large volumes of data. ^^^^ ^^^^^ cnterion function could yield 

^. . i-i..- r ^ • J. *.ic satisfactory results for numeric attributes, it is not appropri- 
Qustermg is a useful technique for gtoupmg data pointsq25 ..tPanWn.i «ttWhnt.« Por JZ^^i. 

such that points within a single group/cluster have similar 
characteristics, while points in different groups are dissimi- 
lar. For example, consider a market basket database con- 
taining o ne transaction per customer with each transactio n 
conta ining the seLof items purchased by the custome r. The 
transaction data can be used to cluster the customers such 
that customers with similar buying pattems are in a single 
cluster. 



ate for data sets with categorical attributes. For example, 
consider the above market basket database. Typically, the 
number of items and the number of attributes in such a 
database are very large (a few thousand) while the size of an 
average transaction is much smaller (less than a hundred). 
Furthermore, customers with similar buying patterns and, 
therefore, belonging to the same cluster, may buy a small 
subset of items from a much larger set that defines the 
cluster. 

For example, one cluster may consist of predominanUy -^^^^^^ ^^^^^^^^ ^j^^^^^ ^^^^^^ ^ 

married customers with infants that buy diapers, baby food, ^ed items like French wine, Swiss cheese, Italian pasta 

toys etc (m addiUon to necessities like milk, sugar, butter, ^^^^^ ^^^^^ ^ transaction within the cluster 

etc.), while another may consist of high-mcome customers ^^^^^^^ ^ ^^^j ^^^^^^ ^^^^ .^^^^ 

that buy miported products like French and Italian wine, ..i^le that a pair of transactions in a cluster wiU have a 

Swiss cheese and Belgian chocolate. The clusters can then -^^^^ ^^^^^^^ ^^^^-^ ^ ^ ^^^^^ 

be used to characterize the different customer groups, and ^^^er transactions within the cluster (having a substantial 
these charactenzations can be used in targeted marketing ^^^^er of items in common with the two transactions), 
and advertising such that specific products are directed i jjv *u * f •* ^ ^ • *u i * . 

Ji '41 , In addition, the set of items dennmg the clusters may not 

toward specific customer groups. . .j: . a i . • i j- i 

^ & r hayc umiorm sizes. A cluster mcludmg common items such 

The charactenzalion can also be used to predict buying ^s diapers, baby food and toys, for example, will typically 
pattems of new customers based on their profiles. For involve a large number of items and customer transactions, 
example, it may be possible to conclude that high-income ^vhile the cluster defined by imported products will be much 
customers buy miported foods, and then mail customized sQ^aller. In the larger cluster, since the transactions are 
catalogs for imported foods lo these high-mcome customers. spread out over a larger number of items, most transaction 

The above market basket database, containing customer 5Q pairs will have few items in common. Consequently, a 
transactions, is an example of data points with attributes that smaller percentage of these transaction pairs will have a 
are non-numeric. Transactions in the database can be viewed large number of items in common. Thus, the distances of the 
as records with boolean attributes, each attribute corre- transactions from the mean of the larger cluster will be much 
sponding to a single item. Further, in a record for a larger. 

transaction, the attribute corresponding to an item is "true" 55 since the criterion function of the parlitional clustering 
if and only if the transaction contains the item; otherwise, it method is defined in terms of distance from the mean of a 
is "false." Boolean attributes themselves are a special case cluster, splitting large clusters generally occurs. By splitting 
of categorical attributes. larger clusters the distance between a transaction and the 

The domain of categorical attributes is not limited to mean of the cluster is reduced and, accordingly, the criterion 
simply true and false values, but could be any arbitrary finite go function is also reduced. ITierefore, the parlitional clustering 
set of values. An example of a categorical attribute is color method favors splitting large clusters. This, however, is not 
whose domain includes values such as brown, black, white, desirable since the large cluster is split even though trans- 
etc. A proper method of clustering in the presence of such actions in the cluster are well coimected and strongly linked, 
categorical attributes is desired. Hierarchical clustering is a sequence of partitions in 

The problem of clustering can be defined as follows: 65 which each partition is nested into the next partition in the 
given n data points in a d -dimensional space, partition the sequence. Current hierarchical clustering methods, however, 
data points into k clusters such that the data points in a are unsuitable for clustering data sets containing categorical 
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attributes. For instance, consider a cenlroid-based hierarchi- 
cal clustering method in which, initially, each point is treated 
as a separate cliister. Pairs of clusters whose centroids or 
means are the closest are then successively merged until the 
desired number of clusters remain . For categorical attributes, 5 
however, distances between centroids of clusters is a poor 
estimate of the similarity between them as is illustrated by 
the following example. 

Consider a market basket database containing the follow- 
ing 4 transactions concerning items 1, 2, 3, 4, 5 and 6; (a) {1, 
2, 3, 5}, (b) {2, 3. 4, 5}, (c) {1,4}, (d) {6}. The transactions 
can be viewed as points with boolean attributes (where 0 
indicates that an item is missing while 1 indicates that an 
item is present in the transaction) corresponding to the items 
1, 2, 3, 4, 5 and 6. The four points (a, b, c, d) thus become 
(1,1,1,0,1,0), (0,1,1,1,1,0), (1,0,0,1,0,0) and (0,0,0,0,0,1). 
Using the Euclidean distance metric to measure the close- 
ness between points/clusters, the distance between the first 
two points (a and b) is which is the smallest distance 
between any pairs of the four points. As a result, points a and 
b are merged by the cenlroid-based hierarchical approach. 
The centroid of the new merged cluster is (0.5,1,1,0.5,1,0). 
In the next step, the third and fourth points (c and d) are 
merged since the distance between them is ^which is less 
than the distance between the centroid of the merged cluster 
and points c or d respectively. However, this leads to 
merging transactions {1,4} and {6} that do not have a single 
item in common. Thus, using distances between the cen- 
troids of clusters while making decisions about the clusters 
to merge could cause points belonging to different clusters 
to be assigned to the same cluster. 

Once points belonging to different clusters are merged, 
the situation gets progressively worse as the clustering 
progresses. What typically happens is a ripple efifect, that is, 35 
as the cluster size grows, the number of attributes appearing 
in the mean go up, and their value in the mean decreases. 
This makes it very difScult to distinguish the difference 
between two points that differ on few attributes, or two 
points that differ on every attribute by small amounts. The 40 
following example wiU make this issue very clear. 

Consider the means of two clusters ('/s, V3, V3, 0, 0, 0) and 
(0, 0, 0, Vi, 1/3, V3) with roughly the same number of points. 
Even though, the two clusters have no attributes in common, 
the Euclidean distance between their means is less than the 
distance of a point (1, 1, 1, 0, 0, 0) to the mean of the first 
cluster even though the point has items in common with the 
first point. This is undesirable since the point shares com- 
mon attributes with the first cluster. A method based on 
distance will merge the two clusters and wiU generate a new 50 
cluster with mean {Ye, Ve, Ve, Ve, Ve, Ve). 

An interesting side effect of this merger is that the 
distance of the point (1, 1, 1, 0, 0, 0) to the new cluster is 
even larger than the original distance of the point to the first 
of the merged clusters. In effect, what is happening is that the 
center of the cluster is spreading over more and more 
attributes. As this tendency starts, the cluster center becomes 
closer to other centers which also span a large number of 
attributes. Thus, these centers tend to spread across all 
attributes and lose the information about the points in the 
cluster that they represent. 

"Set theoretic" similarity measures such as the Jaccard 
coefficient have often been used, instead of the Euclidean 
distance, for clustering data contained within databases. The 65 
Jaccard coefficient for similarity between transactions T^ 
and T2 is 



4 

iTi n 7-2! 
\T1\jT2\' 

With the Jaccard coeflScient as the distance measure between 
clusters, cenlroid-based hierarchical clustering schemes can- 
not be used since the similarity measure is non-metric, and 
defined for only points in the cluster and not for its centroid. 
Therefore, a minimum spanning tree (MST) hierarchical 
clustering method or a hierarchical clustering method using 
a group average technique must also be employed. The MST 
method merges, at each step, the pair of clusters containing 
the most similar pair of points while the group average 
method merges the ones for which the average similarity 
between pairs of points is the highest. 

The MST method is known to be very sensitive to outliers 
(an outlier is a noise impulse which is locally inconsistent 
with the rest of the data) while the group average method has 
a tendency to split large clusters (since the average similarity 
between two subclusters of a large cluster is small). 
Furthermore, the Jaccard coefficient is a measure of the 
similarity between only the two points in question. Thus it 
does not reflect the properties of the neighborhood of the 
points. Consequently, the Jaccard coefficient fails to capture 
the natural clustering of data sets forming clusters with 
categorical attributes that are close to each other as illus- 
trated in the following example. 

FIG. 1 illustrates two transaction clusters 10, 12 of a 
market basket database containing items 1, 2, ... 8, 9. The 
first cluster 10 is defined by 5 items while the second cluster 
12 is defined by 4 items. These items are shown at the top 
of the two clusters 10, 12. Note that items 1 and 2 are 
common to both clusters 10, 12. Each cluster 10, 12 contains 
transactions of size 3, one for every subset of the set of items 
that define the clusters 10, 12, 

The Jaccard coefficient between an arbitrary pair of trans- 
actions belonging to the first cluster 10 ranges from 0.2 (e.g., 
{1, 2, 3} and {3, 4, 5}) and 0.5 (e.g., (l, 2, 3} and {1, 2, 4}). 
Note that even though {l, 2, 3} and {1, 2, 7} share two 
common items and have a high coefficient (0.5), they belong 
to different clusters 10, 12. In contrast, {l, 2, 3} and {3, 4, 
5} have a lower coefficient (0.2), but belong to the same 
cluster 10. 

The MST method would first merge transactions {l, 2, 3} 
and {1, 2, 7} since the Jaccard coefficient has the maximum 
value (0.5). Once this merger occurs, the new cluster may 
subsequently merge with transactions from both clusters 10, 
12, for example, {l, 3, 4} or {1, 6, 7}, since these are very 
similar to transactions in the new cluster. This is an expected 
result since it is weU known that the MST method is fragile 
when clusters are not well-separated. 

The use of a group average technique for merging clusters 
ameliorates some of the problems with the MST method. 
However, it may still fail to discover the correct clusters. For 
example, similar to the MST method, the group average 
method would first merge a pair of transactions containing 
items 1 and 2, but belonging to the different clusters 10, 12. 
The group average of the Jaccard coefficient between the 
new cluster and every other transaction containing both 1 
and 2 is still the maximum (0.5). Consequently, every 
transaction containing both 1 and 2 may get merged together 
into a single cluster in subsequent steps of the group average 
method. Therefore, transactions {1,2, 3} and (1, 2, 7} for 
the two separate clusters 10, 12 may be assigned to the same 
cluster by the time the method completes. 

The problem of clustering related customer transactions in 
a market basket database has recently been addressed by 
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using a hypergraph approach. Frequent itemsets used to Unlike distances or similarities between a pair of points 

generate association rules are used to construct a weighted which are local properties involving only the two points in 

hypergraph. Each itemset is a hyperedge in the weighted question, links incorporate global information about the 

hypergraph and the weight of the hyperedge is computed as other points in the neighborhood of the two points being 

the average of the confidences for all possible association 5 considered. The larger the number of hnks between a pair of 

rules that can be generated from the itemset. Then, a points, the greater the likehhood that they belong to the same 

hypergraph partitioning procedure is used to partition the cluster. Thus, clustering using links injects global knowl- 

items such that the sum of the weights of hyperedges that are ^^^^ -^jq clustering process and is thus more robust. For 

cut due to the partitioning is minimized. The result is a example, even though a pair of clusters are close to each 

clustenng of items (not transactions) that occur together in ^^^^^ ^^^^ ^ ^^^^ ^^^^ ^1 

the transactions^ The item clusters are then used as the ^ ^^^^^^^ .^^^ ^^^^^^ ^^^^ ^^^^ 

descnption of the cluster and a scormg metric is used to - ... < ^, r v 1 

. ^ , ^ ^. . *u u * 1 * c very few common neighbors and thus very few links, 

assign customer transactions to the best item cluster. For ^ ^ ' 

example, a transaction T may be assigned to the item cluster In another embodiment of the present invention, time and 

C for which the ratio space complexities are reduced by drawing a random sample 

15 of data points from the database. Random samples of 

\TC\Ci\ moderate sizes preserve information about the clusters while 

— reducing execution time and memory requirements, thus 

enabling the present invention to correctly cluster the input 

quickly and more efEciently. 

is the highest. 20 

The rationale for using item clusters to cluster transac- BRIEF DESCRIPTION OF THE DRAWINGS 

tions is questionable. For example, the approach makes the ^ ^ . , , j ^ n ^ ^ 

assumption that itemsets that define the clusters are disjoint Th^ foregomg and other advantages and features of the 

and have no overlap among them. This may not be true in invention will become more apparent from the detailed 

practice since transactions in different clusters may have a 25 description of the preferred embodiments of the mvenUon 

few common items. For instance, consider the market basket g^^^" .^1°^ ^^t^ reference to the accompanying drawings 

database in the above example (FIG. 1). With minimum ^ which: 

support set to 2 transactions, the hypergraph partitioning FIG. 1 is an illustration of a target clustering of a database 

scheme generates two item clusters of which one is {7} and containing data with categorical attributes; 

the other contains the remaining items (since 7 has the least 30 FIG. 2 is a block diagram illustrating the overall method 

hyperedges to the other items). However, this results in of a preferred embodiment of the present invention; 

transactions {l, 2, 6} (from cluster 12) and {3, 4, 5} (from 3 iUustrates a block diagram of a computer system 

cluster 10) being assigned to the same cluster since both ^j. accomplishing the clustering method of FIG. 2; 

have the highest score with respect to the big item cluster. ^ illustrates, in flow chart form, the clustering 

Accordmgly, a clustering method that can correcUy iden- 35 ^^^^ ^ ^^^^^^^^ embodiment of the present invention; 

tify clusters cnntaiomg data with categorical attributes, . 

while efiBciently handling outliers, is still needed. ^0. 5 illustrates, in flow chart form, the Imk computation 

method of a preferred embodiment of the present invention; 

SUMMARY OF THE INVENTION ^ ^^^^^^^^^ ^^art form, the process of 

This invention provides a method, apparatus and pro- 4Q selecting a random data set in accordance with a preferred 

grammed medium for clustering databases capable of iden- embodiment of the present invention; 

tifying clusters containing data with categorical attributes j illustrates, in flow chart form, the process of 

whUe efiSciently handhng outliers. providing a cluster label to data in accordance with a 

This invention also provides a method, apparatus and preferred embodiment of the present invention; and 

programrned medium for clustermg large databases capable 45 g illustrates, in flow chart form, the process of 

of identifymg cluste^ containmg data with categorical ^utUers in accordance with a preferred embodi- 

attnbutes with reduced tune and space complexities. ^^^^ ^^^^^^ invention. 

The invention accomplishes the above and other objects 

and advantages by providing a method of clustering data- DETAILED DESCRIPTION OF THE 

bases containing data with categorical attributes that uses 50 PREFERRED EMBODIMENTS 
links to measure the similarity and proximity between the 

data points and then robustly merges clusters based on these FIG. 2 illustrates, generally, the method steps 100 

liujjs involved in clustering a database according to various pre- 

Using Unk s between data points, instead of distance ^^rred embodiments of the present invention. For example, 

metrics, allows the present invention to overcome the prob- 55 a first embodiment of the present invention performs method 

lems of the partitional and Jaccard coefficient clustering step 106, while the preferred embodiment perfonns all steps 

schemes. The present invention assigns a pair of points to be 104 through 108. For overview purposes, aU of the steps 

neighbors if their similarity exceeds a certain threshold. The shown in FIG. 2 are now briefly descnbed. In step 104, the 

similarity value for pairs of points can be based on distances method begins by drawmg a random sample of data pomts 

or any other non-metric similarity function obtained from a 60 ^^om the input data 102. This random sample is then 

domain expert/similarity table. The number of links between clustered by first generatmg links between the pomts and 

a pair of points is the number of common neighbors for the then clustering the links in step 106. Once the clustering of 

points- Points belonging to a single cluster will generafly the random sample is completed, the clusters involving only 

have a large number of common neighbors, and conse- the random points are used to label the remainder of the data 

quently more links. Thus, during clustering, merging clus- 65 set in step 108, 

ters and points with the most number of Unks first will result Referring now to FIG. 3, in a preferred embodiment, the 

in better and more meaningful clusters, clustering method 100 of the present invention is performed 
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on a programmed general purpose computer system 140. 
The computer system 140 includes a central processing unit 
(CPU) 142 that communicates to an input/output (I/O) 
device 144 over a bus 152. A second I/O device 146 is 
illustrated, but is not necessary to practice the method 100 
of the present invention. The computer system 140 also 
includes random access memory (RAM) 148, read only 
memory (ROM) 150, and may include peripheral devices 
such as a floppy disk drive 154 and a compact disk (CD) 
ROM drive 156 that also communicate with the CPU 142 
over the bus 152. It must be noted that the exact architecture 
of the computer system 140 is not important and that any 
combination of computer compatible devices may be incor- 
porated into the system 140 as long as the method 100 of the 
present invention can operate on a system 140 having a CPU 
142, RAM 148 and storage memory as described below. 
Moreover, the program for CPU 142 which causes it to 
implement the invention may be stored in ROM 150, 
CD-ROM 156, floppy disk 158, a hard drive or any other 
medium capable of storing a program. During execution of 
the program it will be loaded into RAM 148. All of these 
devices communicate with CPU 142 as is well known in the 
art. 

The CPU 142 performs logical and mathematical opera- 
tions required by the method 100 of the present invention, 
such as data manipulation and comparisons, as well as other 
arithmetic and logical functions generally understood by 
those of ordinary skill in the art. The RAM 148 is used to 
store the data 102 to be clustered, clustered data and 
program instructions required to implement the method 100 
of the present invention and can be comprised of conven- 
tional random access memory (RAM), bulk storage memory, 
or a combination of both, as generally understood by those 
of ordinary skill in the art. The I/O devices 144, 146 are 
responsible for interfacing with an operator of the computer 
system 140 or with peripheral data devices such as a hard 
drive or other device (not shown) to receive or distribute the 
data 102 as generally understood by those of ordinary skill 
in the art. 

The following definitions are required for a better under- 
standing of the clustering method of the present invention. 
The method will be described in detail afterwards. Neigh- 
bors of a point are those points that are considerably similar 
to it. Let sim(p^, py) be a similarity function that is normal- 
ized and captures the closeness between the pair of points p, 
and py. llie function sim could he one of the well-known 
distance metrics (e.g., Euclidean), or it could even be 
non- metric. The function sim comprises values between 0 
and 1, with larger values indicating that the points are more 
similar. Given a ±reshold 6 between 0 and 1, a pair of points 
?i and Pj are defined to be neighbors if: 

In the above equation, e is a user-defined parameter that 
can be used to control how close a pair of transactions must 
be in order to be considered neighbors. Thus, higher values 
of 6 correspond to a higher threshold for the similarity 
between a pair of points before they are considered neigh- 
bors. Assuming that sim is 1 for identical points and 0 for 
totally dissimilar points, a value of 1 for 9 constrains a point 
to be a neighbor to only other identical points. On the other 
hand, a value of 0 for 6 permits any arbitrary pair of points 
to be neighbors. Depending on the desired closeness, an 
appropriate value of 6 may be chosen by the user. Preferably, 
G will have a value between 0.5 and 0.8. 

For market basket dat a, the database consists of a set of 
transactions»-£aciLjaf which is a set of items. A possible 



definition for sim(T-^, T2), the similarity between the two 
transactions T^, and Tj is the following: 
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Where |Tj is the number of items in T,-. The more items 
that the two transactions T^ and T2 have in common, that is, 
the larger |TinT2| is, the more similar they are. Dividing by 
IT^UT^I is the scaling factor which ensures that 9 is between 
0 and 1. Thus, the above equation computes the relative 
closeness based on the items appearing in both transactions 
Ti and T2. 

The above definition of a neighbor rules out subsets of a 
transaction that are very small in size. A typical example is 
that of a store where milk is bought by everyone. A trans- 
action with only milk will not be considered very similar to 
other bigger transactions that contain milk. Also, note that 
for a pair of transactions T^, and T2, sim can take at most 
min{|Ti|, iTjj+l values. Thus, there are at most min{|Ti|, 
|T2|}+1 distinct similarity levels between the two transac- 
tions. As a result, if most transactions have uniform sizes, 
then the possible values for sim are greatly reduced which 
simplifies the choice of an appropriate value for the param- 
eter 9. 

Data sets with categorical attributes are similariy handled. 
Categorical data typically is of fixed dimension and is more 
structured than market basket data. However, it is still 
possible that in certain records, values may be missing for 
certain attributes. 

The present invention handles categorical attributes with 
missing values by modeling each record with categorical 
attributes as a transaction. Corresponding to every attribute 
A and value v in its domain, we introduce an item A.v. A 
transaction T^. for a record contains A.v if and only if the 
value of attribute A in the record is v. If the value for an 
attribute is missing in the record, then the corresponding 
transaction does not contain items for the attribute. Thus 
missing values are ignored and the aforementioned similar- 
ity function can be used to compute similarities between the 
records by determining the similarity between the corre- 
sponding transactions. 

It must be noted that the above suggested method for 
dealing with missing values is one of several possible ways 
to handle missing values and the present invention should 
not be so limited. For example, when dealing with time 
series data, the following method could be used. For time 
series data, each data point consists of a sequence of time 
slot and value pairs. Each data point can be viewed as a 
record with every time slot corresponding to a single cat- 
egorical attribute. The values that are possible in the time 
slot then constitute the domain of the categorical attribute. 
Missing values for attributes can frequently result since two 
individual time series could be sampled at different times. 
For example, for young mutual funds that began a year ago, 
prices for time periods preceding the last year do not exist. 

In this case, in order to compute the similarity between 
two records, the present invention only considers attributes 
that have values in both records. This way, if two records are 
identical for the attributes that do not contain missing values, 
then it is concluded that the similarity between them is high 
even though one of the records may have a missing value. 
Thus, for a pair of records, the transaction for each record 
only contains items that correspond to attributes for which 
values are not missing in either record. The similarity 
between the transactions can then be computed as described 
above. 
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The present invention utilizes a definition of a link(p,-, py) 
to be the number of common neighbors between points p, / mc \ 

and py (common neighbors meaning points that are neigh- (i _ J 

bors to both points p, and py). From the definition of links, i-9 
it follows that if iink(p^-, py) is large (that is, there are a large 5 

□umber ofcommon neighbors between the points), then it is . , . i,u„„ ^v., t,o„co^t,'^« n 

, , . , , ^ Thus, the number of neighbors for a transaction in Q- is 

highly probable that p,- and py belong to the same cluster. The approximatelv 
method of the present invention exploits this property when 

making decisions about the points to merge into a single ^ ^ j 

cluster. rtTT§and/(9)= — . 

This link-based approach adopts a global approach to the 

clustering problem. It captures the global knowledge of . ■ r. • • j . j » 

.,,T^, . , -J- The criterion function is used to estimate the goodness 

neighboring data points mto the relationship between indi- ^ i * t-u u * i * - r • * 7u *u * 

. 1 ■ 1 of clusters. The best clustering of points are those that 
vidual pairs of points. ITius, smce the clustermg procedure 15 resulted in the highest values for the criterion function. Since 

of the present invention utilizes the informaUon about links g^^j ^^^^^ ^^^^^^^ invention is to perform clustering that 

between pomts when makmg decisions on the points to be maximizes this criterion function, the present invention 

merged into a single cluster, the method for clustering is utilizes a goodness measure, similar to the criterion function, 

very robust. to determine the best pair of clusters to merge. Briefly, the 

Links between a pair of points, in effect, are the number goodness measure is based upon the total number of cross 

of distinct paths of length 2 between points p, and py such hnks(that is, links between a cluster and another cluster) and 

that every pair of consecutive points on the path are neigh- an estimated number of cross links (that is, an estimated 

bors. Alternative definitions of links, based on paths of 3 or number of links between a cluster and another cluster). For 

more, are also possible although they include additional a pair of clusters Cy, let link [Q, Cy] store the number of 

information that may not be desired and are less efficient cross links between clusters C, and Cy, that is, 
timewise. 

The method of the present invention utilizes a criterion 2 P^'> 

function to characterize the best clusters. It is desired that Pq^<^rPr^^t 
each cluster has a high degree of connectivity, therefore, 30 

there is the need to maximize the sum of link(P^ P^) for data xhen, the goodness measure g(C,-, Cy) for merging clusters 

point pairs and P^, belonging to a single cluster and at the Q, Cy is defined as follows: 
same time, minimize the sum of link(P^, P^) in different 

clusters. This leads to the following criterion function that is iink[C;, Cj] 

to be maximized for the k desired clusters: ^■>^~ (ni + njp'^f^^^ - ni^*'^ /(^ - m 



t 

where C, denotes cluster i of size n, and f(e) is a function 
that is dependent on the data set, as well as the kind of 
clusters desired, and has the following important property: 
each point belonging to cluster C,- has approximately n/*^®^ 
neighbors in C^. 

One possibility for f(e) is 

1-0 

for market basket data. This can be informally derived under 
the simplifying assumptions that transactions are of approxi- 
mately the same size t and are uniformly distributed amongst 
the m items purchased by customers in cluster C,, For some 
constant c^l, the number of transactions in the cluster is 
approximately 

(T) 



and the number of transactions whose similarity to a par- 
ticular transaction T,. exceeds 9 is approximately 



The pair of clusters for which the above goodness mea- 
sure is maximum is the best pair of clusters to be merged at 

40 any given step. 

Referring now to FIG. 4, a detailed description of the 
clustering method of the present invention now follows (this 
embodiment performs step 106 illustrated in FIG. 2). The 
input parameters of the method are the input data set S 

45 containing n points to be clustered and the number of desired 
clusters k. Initially, each point is considered to be a separate 
cluster i. In step 200, the procedure begins by computing the 
number of links between pairs of points within the data set 
S. A detailed description of the method used to compute the 

50 number of links between the points is described below (FIG. 

In step 202, a local heap q[i] is created for each cluster i. 
Therefore, at the initial stages of the clustering method of the 
present invention, there is a local heap q[i] for each data 

55 point in the data set S. Each local heap q[i] contains every 
cluster j forming a link with cluster i, link[ij], that is 
non-zero (that is, clusters i and j share common neighbors). 
Link[i,j] has been initialized by the link computation pro- 
cedure described below (FIG. 5). TTie clusters j in q[i] are 

60 ordered in the decreasing order of the goodness measure (as 
defined above) with respect to i, designated as g(i, j). 

In addition to the local heaps q[i] for each cluster i, the 
method of the present invention also maintains a single 
global heap Q that contains all of the clusters. The global 

65 heap Q is initialized in step 204 as follows. The clusters 
contained within the global heap Q are ordered in the 
decreasing order of their best goodness measures. Thus, g(i. 
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max(q[i])) is used to order the various clusters i in the global 214 to 222), the method continues at step 224. At this point, 
heap Q, where max(q[i]), the max element in the Local heap clusters u and v are replaced by the new merged cluster w 
q[i], is the best cluster to merge with cluster i. As will be for every cluster i that contained u or v in its local heap q[i]. 
discussed below, during each step of the method of the In step 224, the merged cluster is inserted into the global 
present invention, the max cluster i in the global heap Q and 5 heap Q and the method continues back at step 206 for 
the max cluster in the local heap q[i] are the best pair of another pass through the main loop (steps 206 to 224). The 
clusters to be merged ^^°P completes (step 230) when the number of remain- 
Once the global heap Q and the local heaps q[i] are ipg clusters contamed within the global heap Q matches th^ 

initiaUzed, the method loops between steps 206 and 224 ^^^""^f u ^' ^^^^^^^^^^^^^^ ^^^f. "f^^ ^^^P 
/ \TT_i *i J* - j.^ completed (step 230) when the number of links between 

(main loop). The looping contmues tmtil a predetermmed lo ^ • c *u • • i l 

: • J-.- • . n : . * .1. every pair oi the remaining clusters becomes zero 

termmalion condition is met. Step 206 illustrates that the (i^^iJ^tf i^at the remaining clusters have no common 

predetermmed termination condition allows loopmg until a ;,eighbors and should not be merged with each other), 

predetermmed desired number of k clusters remain in the Referring to FIG. 5, the method of computing the number 

global heap Q; at which point the method is completed (step ji^j^ between pairs of points is as follows. Briefly, a list 

230). Id addition, another predetermmed termination con- 15 of neighbors is computed for every data point s in the data 

dition allows the clustering method to terminate if the set S. llie method examines aU the neighbors of a data point 

number of links between every pair of the remaining clusters s (the relationship between the data point and its neighbors 

becomes zero (indicating that the remaining clusters do not will be, hereinafter, referred to as a pair of points or pair of 

share any common neighbors and that they should not be neighbors). For each pair of neighbors of the data points, the 

merged together). 20 data point contributes one link. By repeating this process for 

The cluster at the top of the global heap Q is the cluster every point s in the data set S, and keeping track of the link 

with the best goodness measures. Thus, at step 208, the max count for each pair of neighbors, the link counts for all pairs 

cluster u from the global heap Q is extracted from and of points of the data set S will be obtained, 

deleted from the global heap Q. At step 210, the local heap The procedure for computing the number of links between 

q[u] for the extracted cluster u is used to determine the best 25 pairs of points begins by computing a neighbor list nbrlist[s] 

cluster v for cluster u. Since clusters u and v will be merged, for every data point s within the data set S (step 300). Step 

the entry for cluster v is also deleted from the global heap Q 302 is an initialization step in which a link counter array 

in step 210. Clusters u and v are then merged in step 212 to hnk[i j] is set to 0 for all combinations of i an j; where i and 

create a new cluster w containing all of the points from j represent a pair of points. In step 304, loop counter i is set 

clusters u and v. 30 to 1 in preparation of the first pass through the main loop 

Once the new cluster w is created, the method of the (steps 306 to 324) of the link computation procedure. The 

present invention loops between steps 214 and 222 main loop terminates (step 326) only after looping n times 

(secondary loop) to (1) replace the clusters u and v with the (that is, for the number n of data points s within the data set 

new merged cluster w for every cluster i that contains u or S). 

v in its local heap q[i], and (2) create a new local heap q[w] 35 In step 306, the neighbors N of a point (designated by loop 

for cluster w. The looping continues until clusters u and v are counter i) from data set S are retrieved from the neighbor list 

replaced by the new merged cluster w for every cluster i that nbrlist[i]. In step 308, a secondary loop counter j is set to 1 

contains u or v in its local heap q[i], at which point the in preparation of the first pass through a secondary loop 

secondary loop is completed and the method continues at (steps 308 to 320) of the link computation procedure. The 

step 224. 40 secondary loop terminates (step 322) only after looping for 

At step 214, a cluster x is selected from the union of the the number of |N|-1 neighbors for the point being consid- 

clusters contained within the local heap q[u] of cluster u and ered. 

the local heap q[v] of cluster v. At step 216, the number of In step 310 a third loop counter h is set to j+1 (that is, one 

links between clusters x and w are calculated by adding the more than the secondary loop counter j) in preparation for 

number of links between clusters x and u to the number of 45 the following link counting steps (steps 312 to 316). In step 

links between clusters x and v. Additionally, in step 216, 312, the link counter array link[i j] is indexed by the pair of 

clusters u and v are deleted from the local heap q[x] of neighbors N[j] and N[h] from the neighbors of the point 

cluster x. being considered. This indexing represents one pair of 

In step 218, a new goodness measure g(x, w) for the pair neighbors N of point s (represented in the main loop by the 

of clusters x and w is computed and the two clusters are 50 main loop counter i). As stated above, a pair of neighbors N 

inserted into each other's local heaps q[x] and q[w]. It must constitutes one link. Accordingly, link[N|j], N[h]] is set to 

be noted that q[w] can only contain clusters that were the previous value of link[Nlj], N[h]]+1 lo count the new 

previously either in q[u] or q[v] since these are the only fink between the pair of neighbors N|j] and N[h] of data 

clusters that have non-zero links with cluster w. point s (represented by i). In step 314, the third loop counter 

In step 220, the global heap Q is updated with cluster x, 55 h is incremented in preparation of another pass through the 

since as a result of merging clusters u and v, it is possible that counting loop (312 to 316). Step 316 determines if there are 

clusters u or v were previously the best to be merged with more neighbors N[h] to count links between neighbor N|j] 

x and now w becomes the best cluster for being merged with (that is, if the third loop counter h is ^ the number |N| of 

other clusters i. Furthermore, it is also possible that neither neighbors for the point being considered). If there are no 

clusters u or v were the best cluster to merge with x, but now 60 more neighbors N[h], the method continues at step 318. 

w is a better cluster to merge with x. For such cases, Step 318 increments the secondary loop counter j in 

whenever the max cluster in the local heap q[x] for x preparation for another pass through the secondary loop 

changes, cluster x must be relocated within the global heap (steps 308 Lo 320). Step 320 determines if there is another 

Q. neighbor N[j] to undergo the link counting loop (that is, if 

When all of the clusters x, in the union of the clusters 65 the secondary loop counter j is ^ the number |N|-1 of 

within the local heaps q[u] and q[v] of the merged clusters neighbors for the point being considered). If there are no 

u and V, have been examined in the secondary loop (steps more neighbors N[j], the method continues at step 322. 
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Step 322 increments the main loop counter i in prepara- 
tion for another pass through the main loop (steps 306 to 
324). Step 324 determines if there is another point s 
(represented by main lcx)p counter i) in the data set S to be 
considered and to undergo the aforementioned link counting 
loops (that is, if the main loop counter i is ^ the number n 
of data points in the data set S). If there are no more 
neighbors N0]> ttie method is completed (step 326). Having 
completed, the link computation procedure has counted and 
stored all the links link[i,j] between every possible pairs of 
points within the data set S. 

It must be noted that the link computation can be per- 
formed by other methods. For example, by using an nxn 
adjacency matrix A (n is the number of data points s in the 
data set S) in which an entry in the matrix A[ij] is either a 
1 (if there is a link between points i and j] or a 0 (no link) 
the number of links can be obtained by multiplying the 
adjacency matrix A with itself. That is, the number of links 
would be AxA. This method, however, is less efficient than 
the procedure illustrated in FIG. 5. 

TTie time and space complexity of the clustering method 
of the present invention, for n input data points, is now 
examined. Time complexity will be expressed as an order of 
magnitude 0( ). A time complexity of O(n^), for example, 
indicates that if the input size n doubles then the method 
requires four times as many steps to complete. Space com- 
plexity will also be expressed as an order of magnitude 0( 
). A space complexity of O(n^), for example, indicates that 
if the input size n doubles then four times as much working 
storage will be required. 

it is possible to compute links among pairs of points in 
0(n^^^ using the matrix multiplication technique, or alter- 
natively in O (n^mj time for the procedure illustrated in 
FIG. 5 (where m^ is the average number of neighbors). The 
space requirement for the Unk computation is at most 
n(n+l)/2, when every pair of points are linked. However, 
generally, not every pair of points will have links between 
them and the storage requirements are much smaller. It can 
be shown that the space requirement to be 
0(min{nm,„m^n^}) where m,„, is the maximum number of 
neighbors for a point. This holds true because a point i can 
have links to at most min{n, m^m^} other points. 

The initial time to build every local heap q[i] is 0(n) (a 
heap for a set of n input clusters can be built in time that is 
linear in the number of clusters). The global heap Q also has 
at most n clusters initially, and can be constructed in 0(o) 
time as well. The main loop (steps 206 to 222) is executed 
0(n) times. The secondary loop (steps 214 to 222) dominates 
the complexity of the clustering method of the present 
invention. Since the size of each local heap q[i] can be n in 
the worst case, and the new merged cluster w may need to 
be inserted in 0(n) local heaps q[i], the time complexity of 
the secondary loop becomes 0(n log n), and that of the main 
loop is 0(n^ log n) in the worst case. Therefore, the entire 
clustering procedure of the present invention has a worst - 
case time complexity of 0(n^+nm^m^+n^ log n). 

The space complexity of the method of the present 
invention depends on the initial size of the local heaps q[i]. 
The reason for this is that when two clusters are merged, 
their local heaps are deleted and the size of the new cluster's 
local heap can be no more than the sum of the sizes of the 
local heaps of the merged clusters. Since each local heap 
only contains those clusters to which it has non-zero links, 
the space complexity of our clustering algorithm is the same 
as that of the link computation, that is, 0(min{n^, nm„m„}) 

In a second embodiment of the present invention (steps 
104 to 108 illustrated in FIG, 2), enhancements are made 
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such that the clustering method can efiBciently cluster very 
large databases. In order to handle large data sets, an 
eflFective mechanism for reducing the size of the input to the 
method of the present invention is required. One approach to 
achieving this is via random sampling of the input data set; 
the key idea is to apply the clustering method of the present 
invention to a random sample drawn from the data set rather 
than the entire data set. Typically, the random sample will fit 

10 in main-memory and will be much smaller than the original 
data set. Consequently, significant improvements in execu- 
tion time can be realized. Also, random sampling can 
improve the quality of clustering since it has the desirable 
effect of filtering outliers. 

FIG. 6 illustrates the process 198 of obtaining and using 
a random sampling of the input data set (step 104 illustrated 
in FIG. 2). In step 198a, the size s^,.„ of the random sample 
is calculated (the formula for this calculation is described in 

20 detail below). In step 198fc»^ S„^„ random points are selected 
from the data set S. These randomly selected points are then 
provided as the input data set to the clustering procedure 
described above (FIGS. 4-5). Efficient methods of drawing 
a sample randomly from data in a file in one pass and using 

25 constant space are well known in the art. For example, Jeff 
Vilter, in Random Sampling with a Reservoir, ACM Trans- 
actions on Mathematical Software 11(1): 37-57, 1985, dis- 
closes a well known method of drawing a random sample 
which is hereby incorporated by reference. The present 
invention employs one of these well-known methods for 
generating the random sample. In addition, the overhead of 
generating a random sample is very small compared to the 
time for performing clustering on the sample (random 
sampling typically takes less than two seconds to sample a 
few thousand points from a file containing hundred thousand 
or more points). 

The present invention utilizes Chernoff bounds (described 
below) to analytically derive values for sample sizes s for 

40 which the probability of missing clusters is low. It is 
assumed that the probability of missing a cluster u is low if 
the sample contains at least f|u| points from the sample, 
where O^f^l. This is a reasonable assumption to make 
since clusters will usually be densely packed and a subset of 

45 the points in the cluster is all that is required for clustering. 
Furthermore, the value of f depends on the cluster density as 
well as the intercluster separation. That is, the more well- 
separated and the more dense clusters become, the smaller 
the fraction of the points from each cluster that we need for 
clustering. 

The present invention uses Chernoff bounds to determine 
the sample size s for which the probability that the sample 
contains fewer than f|u| points belonging to cluster u is less 
55 than 6, where 0^8^ 1. Let Xy be a random variable that is 
1 if the j''' point in the sample belongs to cluster u and 0 
otherwise. It can be assumed that all X^, X2, . . . , X^ are 
independent 0-1 random variables. Note that X^, X2, . . - , 
X^ are independent BemoulH trials such that for l^j ^s 

60 

fl[X;= 1) = — , 

65 where N is the size of the entire data set. Thus, the number 
of data points in the sample belonging to cluster u is given 
by the random variable 
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Also, the expected value of X, 



^=E[X]=E 



Zs\u\ 



and the sample contains 

e\u\ 

\Umn\ 



10 



points from an arbitrary cluster u. From the above 
^ ^ , ^ „^ ^ . . , , discussion, the following equation can be derived for the 

Chemoff bounds state that for O^e^l, and independent minimum sample size. 
Poisson trials X^, . . . ^ X^,, 



P[X<a-e)fi] 



(1) 



15 



(4) 



Thus, due to Chemoff bounds, the expression oc the right 
hand side is an upper bound on the probability that the 20 
number of u's points in the sample, X, falls below the 
expected count, //, by more than e^. For the clustering 
application of the present invention, it is desired that the 
probability of the number of u's points in the sample, X, 
falling below f|u| to be no more than 6. Stated otherwise, 25 

iVC<Au\\<b (2). 

Equations (1) and (2) can be solved to derive the follow- 
ing value for the minimum sample size s. For a cluster u, 30 
equation (2) holds if the sample size s satisfies 

../A,.^.og(l).^^(^if.2/N.c,8(i) 

ITius, based on the above equation, it can be concluded 
that for the sample to contain at least f|u| points belonging to 
cluster u (with high probability), the sample must contain 
more than a fraction f of the total number of points. Also, 40 
suppose u^^, is the smallest cluster desired, and s^^-„, is the 
result of substituting |u^J for |u| in the right hand side of 
equation (3). It is easy to observe that equation (3) holds for 
s«s^,-„ and all |u|^|u^J. Thus, with a sample of size s^,-„ it 
is guaranteed with a high probability, 1-6, that the sample 45 
contains at least f|u| points from an arbitrary cluster u. Also, 
assuming that there are k clUvSters, with a sample size of s^^ 
the probability of selecting fewer than f|u| points from any 
one of the clusters u is bounded above by k. 

Equation (3) seems to suggest that the sample size 50 
required increases linearly with the input size N. However, 
this is misleading since as N increases, so does |u^J, while 
the fraction f of each cluster that we need for clustering 
decreases. Typically, there is no interest in clusters that are 
very small relative to the input size since these tend to be 
statistically iasignificant and are considered to be outliers. 
Thus, it can be assumed that the smallest cluster is some 
percentage of the average cluster size; that is, 

N 60 

IWminl = — > 
kp 

where p>l. Furthermore, depending on the geometry of the 
smallest cluster, in order to represent it adequately in the 
sample, a constant number of points, e. is needed from it. 
Thus, 



This implies that if only clusters whose size is greater than 

N 
Vp 

and require e points from the smallest cluster is desired, then 
a sample which is independent of the original number of data 
points is required. Preferably, for a k«100, p«2, e«10 and 
6»0.001, a sample size of about 6000 is sufficient for 
clustering. This is independent of N, thus making the method 
of the present invention very scaleable. 

Since the input to the clustering method of the present 
invention is a set of randomly sample points from the 
original data set, the final k clusters involve only a subset of 
the entire set of points. Therefore, a cluster labeling process 
(step 108 illustrated in FIG. 2) must be performed. Referring 
now to FIG. 7, the procedure 240 for assigning the appro- 
priate cluster labels to the remaining data points begins by 
selecting a fraction of labeling points L,- from each cluster i 
(step 240a). In step 240!? a data point from the remaining 
data set S is input. In step 240c the data point is assigned to 
a cluster in which the point has the maximum amount of 
neighbors N, within one of the labeling sets of points L,- ITiat 
is, if a point has neighbors in set L^, then the point is 
assigned to the cluster i for which 

is the maximum. Preferably, f(6) is 

\-e 
T7e' 

Step TA^d determines if there are any more points in the data 
set S that were not used in the clustering method. If there are 
more points, the labeling procedure 240 continues at step 
240fej otherwise the procedure 240 is completed and all 
points from the data set S have been assigned a cluster label. 

In the clustering method of the present invention, outliers 
can be handled fairly effectively. The first pruning occurs by 
setting a value for 6, the similarity threshold for points to be 
considered neighbors. Since by definition outliers are rela- 
tively isolated from the rest of the points, outliers will rarely 
65 cross the similarity threshold. This immediately allows the 
method of the present invention to discard the points with 
very few or no neighbors because they will never participate 
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in the clustering. Preferably, as stated earlier, 6 will have a 
value between 0.5 and 0.8. 

However in some situations, outliers may be present as 
sniall groups of points that are loosely connected to the rest 
of the points in the data set. This suggests that these clusters, 
formed by outliers, will persist as small clusters during the 
clustering process. These cluster only participate in the 
clustering procedure when the number of clusters remaining 
is actually close to the number of clusters in the data. 
Therefore, in a preferred embodiment of the present 
invention, clustering will stop at a point when the number of 
remaining clusters is a small multiple of the desired number 
of clusters k. Then an outlier elimination procedure 235 is 
performed as shown in FIG. 8 (performed at the conclusion 
of step 106 illustrated in FIG. 2). 

As illustrated in FIG. 8, in step 235^^ one of the remaining 
clusters is selected. Step 2356 determines if the cluster has 
few neighbors (for example, 5 or less). If the selected cluster 
has few neighbors, the procedure continues at step 235c, 
where the cluster (of outliers) is deleted. Once the cluster is 
deleted, the procedure continues at step 23Sd. If step 2356 
determines that the selected cluster has more than a few 
neighbors, the procedure continues at step 235d. Step 235d 
determines if there are any more remaining clusters. If there 
are more cliisters. the procedure continues back at step 235fl, 
otherwise the procedure is completed leaving remaining 
clusters not comprised of outliers. 

The invention is preferably carried out with a general 
purpose computer which is programmed to carry out the 
operations discussed above. However, it should be under- 
stood that one or more of the operations described may also 
be carried out in an equivalent manner by a special purpose 
computer or hardware circuits. Thus, one or more portions 
of the invention may be performed with hardware, firmware, 
software or any combination of these. 

While the invention has been described in detail in 
connection with the preferred embodiments known at the 
time, it should be readily understood that the invention is not 
hmited to such disclosed embodiments. Rather, the inven- 
tion can be modified to incorporate any number of 
variations, alterations, substitutions or equivalent arrange- 
ments not heretofore described, but which are commensu- 
rate with the spirit and scope of the invention. Accordingly, 
the invention is not to be seen as hmited by the foregoing 
description, but is only limited by the scope of the appended 
claims. 

What is claimed as new and desired to be protected by 
letters patent of the united states is: 

1. A computer based method of clustering related data 
stored in a computer database, said computer database stored 
on a computer readable medimn and including a set of data 
points having categorical attributes, the method comprising 
the steps of: 

a) determining all neighbors for every data point within 
said computer database; 

b) establishing a cluster for every data point in said 
computer database; 

c) determining a total number of links between each 
cluster and every other cluster based on a number of 
common neighbors between each cluster and every 
other cluster; 

d) calculating a goodness measure between each cluster 
and every other cluster based upon the total number of 
links between each cluster and every other cluster and 
an estimated number of hnks between each cluster and 
every other cluster; 

e) merging a pair of clusters having the best goodness 
measures into a merged cluster; 
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f) repeating steps c) through e) until a predetermined 
termination condition is met; and 

g) storing clusters which remain after step f) in a computer 
readable medium. 

5 2. The method of claim 1 wherein the predetermined 
termination condition is obtaining a desired number of 
clusters. 

3. The method of claim 1 wherein the predetermined 
termination condition is obtaining remaining clusters that do 
not have any links between said remaining clusters. 

4. The method of claim 1 wherein the step of determining 
all neighbors for every data point within said computer 
database is determined by calculating a similarity between 
said data points. 

5. The method of claim 4 wherein calculating the simi- 
larity between said data points comprises the steps of: 

al) calculating a similarity ratio; and 
a2)assigning said points to be neighbors if said similarity 
ratio exceeds a similarity threshold. 

6. The method of claim 5 wherein said similarity thresh- 
'^^ old is selected from the range of 0.5 to 0.8. 

7. Th& method of claim 5 wherein the calculation of said 
similarity ratio is performed by dividing an intersection of 
said data points by a union of said data points. 

8. A computer based method of clustering related data in 
25 a large computer database, said large computer database 

stored on a computer readable medium and including a set 
of data points having categorical attributes, the method 
comprising the steps of: 

a) selecting a random set of data points from said large 
30 computer database; 

b) determining all neighbors for every data point within 
said random set of data points; 

c) estabhshing a cluster for every data point within said 
random set of data points; 

d) determining a total number of links between each 
cluster and every other cluster based on a number of 
common neighbors between each cluster and every 
other cluster; 

e) calculating a goodness measure between each cluster 
^ and every other cluster based upon the total number of 

Unks between each cluster and every other cluster and 
an estimated number of links between each cluster and 
every other cluster; 

f) merging a pair of clusters having the best goodness 
measures into a merged cluster; 

g) repeating steps d) through f) until a predetermined 
termination condition is met; and 

h) storing clusters which remain after step g) in a com- 
puter readable medium. 

9. The method of claim 8 wherein said random set of data 
points has a size determined by Chemoff bounds. 

10. ITie method of claim 8 further comprising the step of: 

i) assigning a cluster label to said data points not included 
55 in said random set of said data points. 

11. The method of claim 10 wherein the step of assigning 
a cluster label to said data points not included in said random 
set of data points comprises the steps of: 

11) selecting a set of labeling points for each of said 
60 remaining clusters; and 

12) assigning a cluster label to said data points not 
included in said random set of data points in which said 
data points have a maximum amount of neighbors with 
the labeling points of one of said clusters. 

65 12. The method of claim 8 wherein the predetermined 
termination condition is obtaining a desired number of 
clusters. 
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13. The method of claim 8 wherein the predetermined 
termination condition is obtaining remaining clusters that do 
riot have any links between said remaining clusters. 

14. The method of claim 8 wherein the step of determin- 
ing all neighbors for every data point within said random set 
of data points is determined by calculating a similarity 
between said data points. 

15. The method of claim 14 wherein calculating the 
similarity between said data points comprises the steps of: 

bl) calculating a similarity ratio; and 
b2) assigning said points to be neighbors if said similarity 
ratio exceeds a simQarity threshold. 

16. The method of claim 15 wherein said similarity 
threshold is selected from the range of 0.5 to 0.8. 

17. The method of claim 15 wherein the calculation of 
said similarity ratio is performed by dividing an intersection 
of said data points by a union of said data points. 

18. The method of claim 8 further including the step of: 
i) eliminating clusters comprised of outliers. 

19. The method of claim 18 wherein the step of elimi- 
nating clusters comprised of outliers is performed by delet- 
ing clusters having a number of links less than a predeter- 
mined threshold from said remaining clusters. 

20. The method of claim 19 wherein the predetermined 
threshold is 5 or less. 

21. The method of claim 18 further comprising the step of: 
j) assigning a cluster label to said data points not included 

in said random set of said data points. 

22. The method of claim 21 wherein the step of assigning 
a cluster label to said data points not included in said random 
set of data points comprises the steps of: 

jl) selecting a set of labeling points for each of said 
remaining clusters; and 

j2) assigning a cluster label to said data points not 
included in said random set of data points in which said 
data points have a maximum amount of neighbors with 
the labeling points of one of said clustens. 

23. A computer readable storage medium containing a 
computer readable code for operating a computer to perform 
a clustering method on a computer database, said computer 
database including data points having categorical attributes, 
said clustering method comprises the steps of: 

a) determining all neighbors for every data point within 
said computer database; 

b) establishing a cluster for every data point in said 
computer database; 

c) determining a total number of links between each 
cluster and every other cluster based on a number of 
common neighbors between each cluster and every 
other cluster; 

d) calculating a goodness measure between each cluster 
and every other cliister based upon the total number of 
hnks between each cluster and every other cluster and 
an estimated number of links between each cluster and 
every other cluster; 

e) merging a pair of clusters having the best goodness 
measures into a merged cluster; 

f) repeating steps c) through e) until a predetermined 
termination condition is met; and 

g) storing clusters which remain after step f) in a computer 
readable medium. 

24. The computer readable storage medium of claim 23 
wherein the predetermined termination condition of said 
clustering method is obtaining a desired number of clusters. 

25. The computer readable storage medium of claim 23 
wherein the predetermined termination condition of said 



9,797 

20 

clustering method is obtaining remaining clusters that do not 
have any links between said remaining clusters. 

26. The computer readable storage medium of claim 23 
wherein said clustering method determines all neighbors of 

5 every data point within said database by calculating a 
similarity between said data points. 

27. The computer readable storage medium of claim 26 
wherein said clustering method performs calculates the 
similarity between said data points by calculating a similar- 

10 ity ratio and assigning said points to be neighbors if said 
similarity ratio exceeds a similarity threshold. 

28. llie computer readable storage medium of claim 27 
wherein said similarity threshold is selected from the range 
of 0.5 to 0.8. 

15 29. The computer readable storage medium of claim 27 
wherein said clustering method calculates said similarity 
ratio by dividing the intersection of said data points by a 
union of said data points. 

30. A computer readable storage medium containing a 
20 computer readable code for operating a computer to perform 

a clustering method on a large database, said large database 
including a set of data points having categorical attributes, 
said clustering method comprises the steps of: 

a) selecting a random set of data points from said large 
25 computer database; 

b) determining all neighbors for every data point within 
said random set of data points; 

c) establishing a cluster for every data point within said 
random set of data points; 

d) determining a total number of links between each 
cluster and every other cluster based on a number of 
common neighbors between each cluster and every 
other cluster; 

35 e) calculating a goodness measure between each cluster 
and every other cluster based upon the total number of 
links between each cluster and every other cluster and 
an estimated number of links between each cluster and 
every other cluster; 

40 f) merging a pair of clusters having the best goodness 
measures into a merged cluster; 

g) repeating steps d) through f) until a predetermined 
termination condition is met; and 

h) storing clusters which remain after step g) in a com- 
puter readable medium. 

31. TTie computer readable storage medium of claim 30 
wherein said random set of data points has a size determined 
by Chernoff bounds. 

32. The computer readable storage medium of claim 30 
wherein said clustering method further comprises the step 
of: 

i) assigning a cluster label to said data points not included 
in said random set of said data points. 

55 33. The computer readable storage medium of claim 32 
wherein the step of assigning a cluster label to said data 
points not included in said random set of data points 
comprises the steps of: 

11) selecting a set of labeling points for each of said 
60 remaining clusters; and 

12) assigning a cluster label to said data points not 
included in said random set of data points in which said 
data points have a maximum amount of neighbors with 
the labeling points of one of said clusters. 

65 34. The computer readable storage medium of claim 30 
wherein the predetermined termination condition of said 
clustering method is obtaining a desired number of clusters. 
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35. The computer readable storage medium of claim 30 
wherein the predetermined termination condition of said 
clustering method is obtaining remaining clusters that do not 
have any links between said remaining clusters. 

36. The computer readable storage medium of claim 30 
wherein said clustering method step of determining all 
neighbors for every data point within said database is 
determined by calculating a similarity between said data 
points. 

37. The computer readable storage medium of claim 36 
wherein said clustering method calculates the similarity 
between said data points calculating a similarity ratio and 
assigning said points to be neighbors if said similarity ratio 
exceeds a similarity threshold. 

38. The computer readable storage medium of claim 37 
wherein said similarity threshold is selected from the range 
of 0.5 to 0.8. 

39. The computer readable storage medium of claim 30 
wherein said clustering method further includes the step of: 

i) eliminating clusters comprised of outliers. 

40. The computer readable storage medium of claim 39 
wherein the cltistering method step of eliminating clusters 
comprised of outliers is performed by deleting clusters 
having a number of links less than a predetermined threshold 
from said remaining clusters. 

41. The computer readable storage medium of claim 40 
wherein the predetermined threshold is 5 or less. 

42. The computer readable storage medium of claim 41 
wherein said clustering method further comprises the step 
of: 

j) assigning a cluster label to said data points not included 
in said random set of said data points. 

43. A programmed computer database clustering system 
comprising: 

means for determining all neighbors for every data point 
within a computer database stored on a computer 
readable medium, said comptiter database including 
data points having categorical attributes; 

means for establishing a cluster for every data point in 
said computer database; 

means for determining a total number of links between 
each cluster and every other cluster based on a number 
of common neighbors between each cluster and every 
other cluster; 

means for calculating a goodness measure between each 
cluster and every other cluster based upon the total 
number of links between each cluster and every other 
cluster and an estimated number of links between each 
cluster and every other cluster; 

means for merging a pair of clusters having the best 
goodness measures into a merged cluster; 

means for storing clusters which remain in a computer 
readable medium. 

44. The programmed computer database clustering sys- 
tem of claim 43 wherein said means for means for deter- 
mining all neighbors for every data point within a computer 
database stored on a computer readable medium calculates 
a similarity between said data points. 

45. The programmed computer database clustering sys- 
tem of claim 43 wherein said means for determining all 
neighbors for every data point within a computer database 
stored on a computer readable medium comprises: 
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means for calculating a similarity ratio; and 

means for assigning said points to be neighbors if said 

similarity ratio exceeds a similarity threshold. 
46. The programmed computer database clustering sys- 
^ tem of claim 45 wherein said means for assigning said points 
to be neighbors if said similarity ratio exceeds a similarity 
threshold uses a similarity threshold selected from the range 
of 0.5 to 0.8. 

-jQ 47. The programmed computer database clustering sys- 
tem of claim 46 wherein said means for calculating a 
similarity ratio calculates a similarity ratio by dividing an 
intersection of said data points by a union of said data points. 

48. A programmed computer database clustering system 
15 comprising: 

means for selecting a random set of data points from a 
large computer database stored on a computer readable 
medium, said large computer database including data 
points having categorical attributes; 

means for determining all neighbors for every data point 
within said large computer database; 

means for establishing a cluster for every data point in 
said computer database; 
25 means for determining a total number of links between 
each cluster and every other cluster based on a number 
of common neighbors between each cluster and every 
other cluster; 

means for calculating a goodness measure between each 
cluster and every other cluster based upon the total 
number of links between each cluster and every other 
cluster and an estimated number of links between each 
cluster and every other cluster; 

means for merging a pair of clusters having the best 
goodness measures into a merged cluster; 

means for storing clusters which remain in a computer 
readable medium. 

49. The programmed computer database clustering sys- 
40 tem of claim 48 further comprising: 

means for assigning a cluster label to said data points not 
included in said random set of said data points. 

50. The programmed computer database clustering sys- 
tem of claim 49 wherein said means for assigning a cluster 
label comprises: 

means for selecting a set of labeling points for each of said 
remaining clusters; and 

means for assigning a cluster label to said data points not 
50 included in said random set of data points in which said 
data points have a maximum amount of neighbors with 
the labeling points of one of said clusters. 

51. The programmed computer database clustering sys- 
tem of claim 49 wherein said means for means for deter- 

55 mining all neighbors for every data point within a computer 
database stored on a computer readable medium calculates 
a similarity between said data points. 

52. The programmed computer database clustering sys- 
tem of claim 49 wherein said means for determining all 
neighbors for every data point within a computer database 
stored on a computer readable medium comprises: 

means for calculating a similarity ratio; and 
means for assigning said points to be neighbors if said 
65 similarity ratio exceeds a similarity threshold. 

53. The programmed computer database clustering sys- 
tem of claim 52 wherein said means for assigning said points 
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to be neighbors if said similarity ratio exceeds a similarity 
threshold uses a similarity threshold selected from the range 
of 0.5 to 0.8. 

54. The programmed cximputer database clustering sys- 
tem of claim 53 wherein said means for calculating a ^ 
similarity ratio calculates a similarity ratio by dividing an 
intersection of said data points by a union of said data points. 

55. The programmed computer database clustering sys- 
tem of claim 48 wherein said program further comprises: 

means for eliminating clusters comprised of outliers. 
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56. The progranamed computer database clustering sys- 
tem of claim 55 wherein said means for eliminating clusters 
comprised of outliers deleting clusters having a number of 
links less than a predetermined threshold from said remain- 
ing clusters. 

57. The programmed computer database clustering sys- 
tem of claim 56 wherein the program further comprises: 

means for assigning a cluster label to said data points not 
included in said random set of said data points, 

« )|t « * Xt 
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